For the final project of Exploratory Data Analysis I have picked Red Wine Quality dataset. Available here : (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv). The dataset contains 11 features plus one label which tell the quality of the wine. The 11 features are chemical properties of the wine and the quality score determines how good the wine is, and it is based on the 11 features. The main task in this exploratory analysis is to determine whether some features are more important in determining the quality or all equal.
This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
#loading all the libraries which will be used during the project
library(ggplot2)
library(grid)
library(foreign)
library(MASS)
library(reshape2)
library(dplyr)
library(gridExtra)
library(GGally)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are records and each record has 12 features. Out of 12, 11 features are chemical quality of the wine and the 12th feature is the most important one because it tells about the quality of the wine based on the 11 chemical values.
## Group.1 x
## 1 3 30
## 2 4 212
## 3 5 3405
## 4 6 3828
## 5 7 1393
## 6 8 144
quality can be treated as categorical feature, as it is having total 6 values, 3,4,5,6,7,8. I will convert quality into ordinal feature. For this I will be creating a new feature quality_cat
## Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
Summary of the data after adding another ordinal feature
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality_cat
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
This is clearly visible that none of the wine has got highest score of 10 or lowest score 0, all wines are between 3 and 8.
Also the number of records for quality score 3 or 8 is very low, (3=10, 8-18).
I will use histogram to see the variable distribution of data.
All the distribution looked normal distribution, which is a good thing for exploration. Also we can observe some positive skew in each distribution.
Only citric.acid and volatile.acidity doesn’t looked normally distributed, I will use log normal for them
After using log_normal distrution, the distribution is started to looked normal distribution.
In this article wine maker has mentioned the importance of using sulpher dioxide in wine, to protect wine from bacteria.(https://www.theguardian.com/science/2013/oct/25/science-magic-wine-making)
I want to see the ratio of total.sulfur.dioxide and free.sulfur.dioxide.
Also I think it would be interesting to the relation between quality and alcohol, becuase i wanted to see how the alcohol content is related to quality of wine.
Another new feature that I am interested in is ratio of pH and fixed.acidity.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality quality_cat total_to_free_sulfur.dioxide
## Min. :3.000 3: 10 Min. : 1.167
## 1st Qu.:5.000 4: 53 1st Qu.: 2.062
## Median :6.000 5:681 Median : 2.667
## Mean :5.636 6:638 Mean : 3.242
## 3rd Qu.:6.000 7:199 3rd Qu.: 3.857
## Max. :8.000 8: 18 Max. :44.000
## quality_to_alcohol ph_to_fixed.acidity
## Min. :0.2727 Min. :0.1872
## 1st Qu.:0.5051 1st Qu.:0.3500
## Median :0.5319 Median :0.4148
## Mean :0.5425 Mean :0.4168
## 3rd Qu.:0.5882 3rd Qu.:0.4778
## Max. :0.8163 Max. :0.8478
The quality_to_alcohol variable exhibits perfect normal distribution, whereas total_to_free_sulfur.dioxide and ph_to_fixed.acidity shows normal distribution but are skewed. And specially ph_to_fixed.acidity is close to showing bimodal characteristics.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## total_to_free_sulfur.dioxide 0.02970992 0.053301531 0.06731722
## quality_to_alcohol 0.18314048 -0.274321170 0.16033826
## ph_to_fixed.acidity -0.93722637 0.250396704 -0.64641449
## residual.sugar chlorides
## fixed.acidity 0.114776724 0.093705186
## volatile.acidity 0.001917882 0.061297772
## citric.acid 0.143577162 0.203822914
## residual.sugar 1.000000000 0.055609535
## chlorides 0.055609535 1.000000000
## free.sulfur.dioxide 0.187048995 0.005562147
## total.sulfur.dioxide 0.203027882 0.047400468
## density 0.355283371 0.200632327
## pH -0.085652422 -0.265026131
## sulphates 0.005527121 0.371260481
## alcohol 0.042075437 -0.221140545
## quality 0.013731637 -0.128906560
## total_to_free_sulfur.dioxide 0.050020578 0.081575123
## quality_to_alcohol -0.011200242 0.029748382
## ph_to_fixed.acidity -0.103954519 -0.162932319
## free.sulfur.dioxide total.sulfur.dioxide
## fixed.acidity -0.153794193 -0.11318144
## volatile.acidity -0.010503827 0.07647000
## citric.acid -0.060978129 0.03553302
## residual.sugar 0.187048995 0.20302788
## chlorides 0.005562147 0.04740047
## free.sulfur.dioxide 1.000000000 0.66766645
## total.sulfur.dioxide 0.667666450 1.00000000
## density -0.021945831 0.07126948
## pH 0.070377499 -0.06649456
## sulphates 0.051657572 0.04294684
## alcohol -0.069408354 -0.20565394
## quality -0.050656057 -0.18510029
## total_to_free_sulfur.dioxide -0.217280165 0.33113276
## quality_to_alcohol -0.006408606 -0.04624174
## ph_to_fixed.acidity 0.131221303 0.06181842
## density pH sulphates
## fixed.acidity 0.66804729 -0.68297819 0.183005664
## volatile.acidity 0.02202623 0.23493729 -0.260986685
## citric.acid 0.36494718 -0.54190414 0.312770044
## residual.sugar 0.35528337 -0.08565242 0.005527121
## chlorides 0.20063233 -0.26502613 0.371260481
## free.sulfur.dioxide -0.02194583 0.07037750 0.051657572
## total.sulfur.dioxide 0.07126948 -0.06649456 0.042946836
## density 1.00000000 -0.34169933 0.148506412
## pH -0.34169933 1.00000000 -0.196647602
## sulphates 0.14850641 -0.19664760 1.000000000
## alcohol -0.49617977 0.20563251 0.093594750
## quality -0.17491923 -0.05773139 0.251397079
## total_to_free_sulfur.dioxide 0.13869704 -0.10862004 0.055982239
## quality_to_alcohol 0.19062360 -0.21605516 0.202131780
## ph_to_fixed.acidity -0.63278026 0.81098487 -0.178842832
## alcohol quality
## fixed.acidity -0.06166827 0.12405165
## volatile.acidity -0.20228803 -0.39055778
## citric.acid 0.10990325 0.22637251
## residual.sugar 0.04207544 0.01373164
## chlorides -0.22114054 -0.12890656
## free.sulfur.dioxide -0.06940835 -0.05065606
## total.sulfur.dioxide -0.20565394 -0.18510029
## density -0.49617977 -0.17491923
## pH 0.20563251 -0.05773139
## sulphates 0.09359475 0.25139708
## alcohol 1.00000000 0.47616632
## quality 0.47616632 1.00000000
## total_to_free_sulfur.dioxide -0.16457178 -0.12433622
## quality_to_alcohol -0.24518051 0.73088115
## ph_to_fixed.acidity 0.17252570 -0.09036474
## total_to_free_sulfur.dioxide
## fixed.acidity 0.02970992
## volatile.acidity 0.05330153
## citric.acid 0.06731722
## residual.sugar 0.05002058
## chlorides 0.08157512
## free.sulfur.dioxide -0.21728017
## total.sulfur.dioxide 0.33113276
## density 0.13869704
## pH -0.10862004
## sulphates 0.05598224
## alcohol -0.16457178
## quality -0.12433622
## total_to_free_sulfur.dioxide 1.00000000
## quality_to_alcohol -0.00381810
## ph_to_fixed.acidity -0.05261529
## quality_to_alcohol ph_to_fixed.acidity
## fixed.acidity 0.183140478 -0.93722637
## volatile.acidity -0.274321170 0.25039670
## citric.acid 0.160338259 -0.64641449
## residual.sugar -0.011200242 -0.10395452
## chlorides 0.029748382 -0.16293232
## free.sulfur.dioxide -0.006408606 0.13122130
## total.sulfur.dioxide -0.046241743 0.06181842
## density 0.190623598 -0.63278026
## pH -0.216055162 0.81098487
## sulphates 0.202131780 -0.17884283
## alcohol -0.245180513 0.17252570
## quality 0.730881147 -0.09036474
## total_to_free_sulfur.dioxide -0.003818100 -0.05261529
## quality_to_alcohol 1.000000000 -0.22354971
## ph_to_fixed.acidity -0.223549710 1.00000000
We can see in the table above that there are many features with -ve correlation with quality.
Now we will use scatterplot to visualise relationship between various +ve correlated features and see how they are related.
The scatterplot for all the features seems unreadable, so I am splitting into 2 scatterplot. Also I am adding density and pH to the scatterplot despite of the fact that they show -ve coorelation.
Few noticable points here are:
I would like to see the scatter plot for non quality features:
Now we will look at some of the interesting bivariate features:
Density and alcohol has high negetive correlation(-0.496)
The inverse relation is clearly visible in the graph, but for every level of the alcohol the variability in density is clearly there.
Quality and alcohol has high correlation (0.476):
The correlation is very much visible in the graph, as we increase the alcohol content the quality gets better. Also the trend starts after the alcohol type 8.
This is very obvious relation as it quality is part of the ratio, lets see if I could find anything interesting in it:
The only interesting fact that I can get from the graph is that when the quality is low, like 3 or 4 the ratio of quality to alcohol gets lower, this would be due to the fact that quality is also directly correlated to alcohol type, lower alcohol type, lower alcohol type means lower quality and hence further lower quality to alcohol ratio.
These 2 features has highest correlation value, so it would be interesting to see the graph for them
As guessed the pH and citric.acid is inversly proportional. Infact when there is 0 citric.acid the pH is around 3.4. Also the thing to notice here is at same citric.acid level qualtity the pH value varies. Only till 0.50 amount of citric acid we can see variation in pH value after that, even we increase the citric acid amount the pH value doesn’t vary much.
Now we will see some relationship between quality and other features
This doesn’t appear to have much relation, median of all the quality level remains same.
We can see that the higher quality wine have lower quantity of volatile acids, this is clearly visible from the fact that acid’s medium reduces as quality of the wine increases.
We can not conclusively say anything about the total_to_free_sulfur.dioxide ratio quantity, as the amount first increases slightly and then decreases and increases in the end.
It appears that the quantity of chlorides reduces in high quality wines as we can see the medium of chlorides decreses as quality increases.
I will try to see the density and alcohol relation further as they have shown high -ve correlation.
What we can see from the plot here is that the quality of wine increases as the content of alcohol increases and density decreases. We can see in the plot that most of the high quality wines are in the right lower corner, which is the area for high alcohol and low density, and low quality wine is in upper left size which is high density and low alcohol zone.
Next the correlation pair was citric.acid and fixed.acidity which shows very high +ve correlation
From the plot, this is clearly visible that there is positive correlation, which we already know. Few interesting points from the plot are:
The max high quality wine lies in the mid of the plot, where citric.acid and fixed.acidity are mid way. But one clear point we can see is that for the same amount of citric.acid if we increase the fixed.acidity, wine quality increases, this is clearly visible for the region where citric.acid has coordinate around 0.4 and volatile.acidity has value between 7-11.
So we can see that volatile.acidity shows variations for certain amount of citric.acid.
One more interesting point to note here is that maximum lowest quality wine lies in the left lower corner, where citric.acid is almost 0 and volatile.acidity has some value, so we can say that even if the add some amount of volatile.acidity in the wine if we havn’t added citric.acid the quality would remian poor.
Now I would like to view the sugar and alcohol relation, although they dont show very high correlation but, sugar and alcohol are important components, as sugar during furmentation breaks down into ethanol and it decides the amount of alcohol in the wine.
the trend of the plot shows that as we increase the alcohol content and sugar, the quality increases. Some interesting finding are that for even lower sugar content the quality increases if we increase alcohol content. And if for lower or mid content alcohol, we increase the sugar content the wine quality increases, so this shows sugar increses alcohol content, and overall wine quality gets increased.
In the begining only I have talked about how only range of quality is used i.e between 3 to 8. And we are not able to explore the extreme min quality and extreme max quality. And since we dont have very large dataset, the number of wines for few quality are very less. I would like make a new division of quality i.e good, medium and bad quality. This would give each quality more number of samples. Any quality less then equal to 5 is bad, 6 is a medium and above 7 is good
## bad med good
## 744 638 217
Now I should see if there any new trends/relation between new quality category and features, replotting many plots and histograms.
Now all the trends are clear and easy to understand, I can draw more clear conclusion from the box plots:
Following features correlat directly with the quality:
Only density is the feature which is negetively correlated.
Residual.sugar shows no trends on any correlation.
pH shows some different trends as it goes down as quality increase from medium to high.
Now we have seen the relation between different measurable features and between quality and other features, now its the time build a model. Building a model would be very usefull for people working in this domain and can use this model to get the help over quality of wine using different features.
Since all the features in this dataset is continous we will be using regression predictive model, like polr from mass package
## Call:
## polr(formula = quality_cat ~ alcohol + density + pH + total_to_free_sulfur.dioxide +
## fixed.acidity + residual.sugar + sulphates, data = data,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## alcohol 0.85491 0.05592 15.2876
## density -186.82488 0.90086 -207.3848
## pH 0.11300 0.46396 0.2435
## total_to_free_sulfur.dioxide -0.07392 0.02723 -2.7147
## fixed.acidity 0.29279 0.04071 7.1919
## residual.sugar 0.07127 0.03565 1.9991
## sulphates 2.85728 0.32187 8.8770
##
## Intercepts:
## Value Std. Error t value
## 3|4 -178.3472 0.9213 -193.5846
## 4|5 -176.4646 0.9203 -191.7492
## 5|6 -172.9772 0.9299 -186.0178
## 6|7 -170.2904 0.9470 -179.8237
## 7|8 -167.3402 0.9810 -170.5847
##
## Residual Deviance: 3220.23
## AIC: 3244.23
## 2.5 % 97.5 %
## alcohol 7.453030e-01 0.96451142
## density -1.885905e+02 -185.05922416
## pH -7.963562e-01 1.02235122
## total_to_free_sulfur.dioxide -1.272809e-01 -0.02055001
## fixed.acidity 2.129965e-01 0.37258095
## residual.sugar 1.394123e-03 0.14115387
## sulphates 2.226417e+00 3.48814179
Now i want to build model based on the new type of quality category
## Call:
## polr(formula = good_bad ~ alcohol + density + pH + total_to_free_sulfur.dioxide +
## fixed.acidity + residual.sugar + sulphates, data = data,
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## alcohol 0.85810 0.05716 15.012
## density -211.68919 0.93100 -227.378
## pH 0.65998 0.48096 1.372
## total_to_free_sulfur.dioxide -0.08830 0.02928 -3.016
## fixed.acidity 0.34663 0.04196 8.262
## residual.sugar 0.08882 0.03644 2.437
## sulphates 2.94682 0.32617 9.035
##
## Intercepts:
## Value Std. Error t value
## bad|med -195.3877 0.9521 -205.2078
## med|good -192.6781 0.9700 -198.6440
##
## Residual Deviance: 2587.617
## AIC: 2605.617
## 2.5 % 97.5 %
## alcohol 0.74606678 0.97013087
## density -213.51391942 -209.86445975
## pH -0.28268162 1.60263734
## total_to_free_sulfur.dioxide -0.14568254 -0.03091439
## fixed.acidity 0.26439834 0.42886835
## residual.sugar 0.01739777 0.16024191
## sulphates 2.30754359 3.58608841
It appears both the model well fitted the data, most of the features have high t-values.
The highest t-value is for alcohol which is understandable as alcohol was the one which affects the quality the most.
There were some limitations for the model, to start with the quality can attain only some value(3-8) and it leaves other, so model would fail if quality value lies outside the range. The other drawback of the model is that it is trained for certain type of wine making technique, there might exists some other technique, that used some different features or same fetaures in different quantity.
In this section we will summarise the EDA using 3 useful plots and present a summary for overall and indivudual plots
This plot in general shows that high quality wine has high content of alcohol and low density which is visible in the right bottom quarter, and lower quality wines have high density and low alcohol content.
When I change the wine quantity to new category good, medium and bad, we can see more clear trends on the quality of wines.
Density doesn’t changes much when we move from bad to good quality, there is slight decrease in the density.
With alcohol it is very clear trend that it is linearly proportional.
With citric.acid we can see there is slight increase in the acid amount when moving from bad to medium quality wine, but we can see a large increase in citric acid amount when we go from medium to good quality.
This plot gives some insight about the wine making, i.e sugar and alcohol are important as, if alcohol is less and we add more sugar then also the quality of wine can be improved, as I reseached in some wine making blogs (https://en.wikipedia.org/wiki/Sugars_in_wine), sugar furments due to yeasts and converts into ethanol. And also this gives something important for some wine maker (i hope they already khow this) that the balance between sugar and alcohol is important, if someone put enough alcohol and some extra sugar, the total alcohol content would might exceede and even though the amount of alcohol and sugar is correct the quality will not be as expected.
Initially this looked like an easy EDA task, but then with large number of features and complicated chemical properties among them makes it a little difficult task. Since my domain knowledge of chemistry is very limited i have searched around internet about wine making and about how one chemical feature used in wine affects others. So like alcohol and sugar, sugar produces alcohol so their is a need to make a right balance between sugar and alcohol. Other features like there are several acids used in wine, they need to be used in right amount otherwise wine quality will degrade. Also sulfurdioxide is very important during bottleing of the wine, it kills the bacteria that makes the wine sour. But controlling the amount of sulfurdioxide is also important as it may make the taste bitter.
But the good part of this exploration was there were many good things came out from exploration those were infact correct according to general wine making technique, like relationship between density and alcohol, relationship between sugar and alcohol, acids quantity. Also the regression model that I created was not perfect but for most of the features they show positive high t-values.
The future work with this exploration would be to apply machine learning algorithms to generate models superioir to the regression model built here. The dataset for applying the machine learning techniques seems small, need more data to train and test to build a good model. Also for better prediction all the quality range should be covered in the dataset, like in this dataset range betweem 3-8 quality was covered.
The other things could be done here is use the model to see the trends here observed can be used for some other type of wine making technique, like at some other place or for making white wine.